Skip to content

Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes#4980

Merged
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:rewire-scheduler-error-metric
Jun 26, 2026
Merged

Source scheduler failure metric labels from error categorization, remove TrackedErrorRegexes#4980
dejanzele merged 2 commits into
armadaproject:masterfrom
dejanzele:rewire-scheduler-error-metric

Conversation

@dejanzele

@dejanzele dejanzele commented Jun 25, 2026

Copy link
Copy Markdown
Member

What Armada exposes now

armada_scheduler_job_error_classification_by_queue and _by_node now label failures with the semantic category from error categorization, read off the Error proto (FailureCategory/FailureSubcategory) instead of a regex match against the message. The metric names and label sets are unchanged. Only the label values change.

armada_scheduler_job_error_classification_by_queue{queue="analytics", category="user_error", subcategory=""} 2
armada_scheduler_job_error_classification_by_queue{queue="analytics", category="internal",   subcategory="lease-expired"} 1
armada_scheduler_job_error_classification_by_node{node="worker-1", cluster="c1", category="user_error", subcategory=""} 2

category was the Error.Reason type (podError, leaseExpired, ...) and is now the semantic category, so dashboards filtering on the old values need updating. subcategory was the first matching regex (empty in practice) and is now FailureSubcategory.

This replaces trackedErrorRegexes, which is removed along with it: the scheduler.metrics.trackedErrorRegexes config and the errorTypeAndMessageFromError / errorRegexes plumbing in metrics.New. It was never set in-repo, so a deployment still setting it just logs an "unused key" warning.

Validation

End to end on a Helm-deployed stack: failing jobs classified by the executor (user_error, oom) and a killed-executor run (internal/lease-expired) all landed on the metric with the expected labels and counts. max-runs-exceeded and job-rejected are job-level rather than run-level errors, so they do not surface in this metric.

The metric names and label sets are unchanged, so the existing PrometheusRule keeps evaluating as before. Pre-existing bugs in that file are fixed separately in #4983.

@dejanzele dejanzele force-pushed the rewire-scheduler-error-metric branch from 5047645 to 5eeb998 Compare June 25, 2026 12:58
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces regex-based error classification with direct reads from FailureCategory/FailureSubcategory proto fields on armadaevents.Error, removing TrackedErrorRegexes from the config and the associated compilation/matching logic in metrics.New and jobStateMetrics.

  • metrics.New is now infallible (returns *Metrics directly instead of (*Metrics, error)), and newJobStateMetrics drops the errorRegexes parameter entirely.
  • Label values for armada_scheduler_job_error_classification_by_queue/node change semantically (e.g., podErroruser_error, empty subcategory → oom); dashboards filtering on the old values will need updating.
  • TestCategoriseErrors is updated to exercise the new proto-sourced labels (infrastructure/oom) end-to-end.

Confidence Score: 5/5

Safe to merge — the change is a straightforward removal of dead regex machinery in favour of reading two proto fields that are already populated by the executor.

The only observable side-effect is a change in metric label values (e.g. podError to user_error), which is explicitly documented. The path from failure detection to counter increment is now a single getter call on a nil-safe proto receiver, and call sites in schedulerapp.go and tests are consistently updated. No new error paths are introduced.

No files require special attention.

Important Files Changed

Filename Overview
internal/scheduler/metrics/state_metrics.go Removes errorRegexes field and failedCategoryAndSubCategoryFromJob/errorTypeAndMessageFromError helpers; calls proto getters directly at the failure-recording site. Clean simplification with no logic regressions.
internal/scheduler/metrics/metrics.go New() is now infallible — drops regex compilation and the returned error; signature and call sites are updated consistently.
internal/scheduler/configuration/configuration.go Removes TrackedErrorRegexes field from MetricsConfig; deployments that still set this key will get an unused-key warning, not an error.
internal/scheduler/metrics/state_metrics_test.go Test signatures updated and TestCategoriseErrors rewritten to use FailureCategory/FailureSubcategory proto fields; TestReset and TestDisable retain pre-existing (harmless) use of state label values for the error-counter slots.
internal/scheduler/schedulerapp.go Call site updated to match new infallible New() signature; TrackedErrorRegexes argument removed. Clean change.
internal/scheduler/scheduler_test.go Package-level schedulerMetrics variable updated to match the new infallible New() signature.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Job Run Fails] --> B[ReportStateTransitions]
    B --> C{jst.Failed?}
    C -- Yes --> D[Lookup jobRunError by run ID]
    D --> E[Read FailureCategory and FailureSubcategory from proto]
    E --> F[Increment jobErrorsByQueue counter]
    E --> G[Increment jobErrorsByNode counter]

    subgraph OLD ["Removed - TrackedErrorRegexes path"]
        H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
        I --> J[Return type and first matched regex string]
    end

    style OLD fill:#ffcccc,stroke:#cc0000
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Job Run Fails] --> B[ReportStateTransitions]
    B --> C{jst.Failed?}
    C -- Yes --> D[Lookup jobRunError by run ID]
    D --> E[Read FailureCategory and FailureSubcategory from proto]
    E --> F[Increment jobErrorsByQueue counter]
    E --> G[Increment jobErrorsByNode counter]

    subgraph OLD ["Removed - TrackedErrorRegexes path"]
        H[errorTypeAndMessageFromError] --> I[Loop over compiled regexes]
        I --> J[Return type and first matched regex string]
    end

    style OLD fill:#ffcccc,stroke:#cc0000
Loading

Reviews (11): Last reviewed commit: "Merge branch 'master' into rewire-schedu..." | Re-trigger Greptile

Comment thread internal/scheduler/metrics/state_metrics.go Outdated
@dejanzele dejanzele force-pushed the rewire-scheduler-error-metric branch from 5eeb998 to 91177c3 Compare June 25, 2026 13:06
@datadog-armadaproject

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 1 Pipeline job failed

CI | test / Golang Integration Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 91177c3 | Docs | Give us feedback!

@dejanzele dejanzele force-pushed the rewire-scheduler-error-metric branch 7 times, most recently from 4e2715f to b0dd7f9 Compare June 26, 2026 12:14
…lureSubcategory and remove TrackedErrorRegexes

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the rewire-scheduler-error-metric branch from b0dd7f9 to 3af6b9f Compare June 26, 2026 13:01
@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@dejanzele dejanzele enabled auto-merge (squash) June 26, 2026 15:18
@dejanzele dejanzele merged commit 1eface2 into armadaproject:master Jun 26, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants